DATA1220-55, Fall 2024
2024-09-11
NONE!
Numerical variables can be continuous or discrete.
The “shape” of numerical data is called its distribution.
Location: the “center” of the data
Scale: the “spread” of the data
Commonly observed patterns in numerical distributions
The location of a numerical variable’s distribution can be thought of as the “center” of the data, around which the bulk of the observations cluster.
Mean: the sum of a values divided by the number of observations (i.e. “average”)
Median: the value in the exact middle of the data
Mode: the most common value in the data (for discrete variables)
How far is each data value from the mean?
Variance: \(s^2\), the sum of the squared differences between each observation’s value and the sample mean \(\bar{x}\) divided by \(n-1\)
Standard deviation: \(s\), the square root of the variance
Range: minimum to maximum
Interquartile Range (IQR): 25th percentile to 75th percentile, the middle 50% of the data
The median and interquartile range are considered to be robust statistics for the numerical summary of data because they are less sensitive to skew and outliers than the mean, variance, and standard deviation.
The presence of outliers and/or skew in a numerical variable’s distribution affects how well summary statistics describe a distribution’s location.
Minimum value
1st quartile (Q1, 25th percentile)
Median (Q2, 50th percentile)
3rd quartile (Q3, 75th percentile)
Maximum value
The mean and standard deviation are really only appropriate for a certain type of unimodal, symmetric distribution called the normal distribution and often misused
Most real world data will be best described by the median and interquartile region as part of a 5-number summary
Normal distributions are unimodal and symmetric
The mean and the median of normally distributed data will be approximately equal
Normally distributed variables are desirable in statistics but rare in practice
All of these distributions have a mean of 0 and a standard deviation of 1, but those metrics are only appropriate for describing the middle distribution.
Dot plot
Histogram
Density Curve
Boxplot
Violin plot
QQ plot
There is a single axis (x) along with a dot marking each data point. The points are usually slightly transparent, so you can see when points are overlapping.
In a stacked dot plot, multiple observations at a single value are stacked on top of each other.
A boxplot is a visual representation of a 5-number summary. The “box” represents the middle 50% of the data, or the interquartile range. The line inside the box indicates the median or 50th percentile. The whiskers, the lines coming out from the box, extend 1.5 x IQR beyond Q1 and Q3. Values larger or smaller than that range are classified as outliers and appear as points.
Examples of the different distribution shapes as histograms
When histograms are skewed, the mean and the median may occur in 2 different bins.
Outliers are easy to spot on a histogram
Modality is easy to spot on a histogram.
Bins that are too narrow may produce gaps. Bins that are too wide can hide the “shape” of the distribution.
Density plots produce a smooth curve of the distribution across all values of the numerical variable. While a histogram represents the count of observations that fall within a particular range, density represents the % of observations that occur at that specific value of the variable.
It is easy to spot modality, skew, and outliers on a density plot.
A histogram with a density curve overlaid, a violin plot, and a boxplot for the same distribution
A boxplot is a visual representation of a numerical summary. It shows the median, interquartile range, range, and if outliers are present.
The whiskers of a boxplot (the lines extending out from the box) are 1.5 times the interquartile region long
Min whisker: Q1 - 1.5 x IQR
Max whisker: Q3 + 1.5 x IQR
If a point is outside this range, it is considered to be a potential outlier
Because it’s hard to spot modality in a box plot, they are often combined with density curves or violin plots.
Some visualizations add a point to the boxplot indicating the location of the mean. If the mean is meaningfully different than the median, you have outliers and/or a skewed distribution.
Raincloud plots combine density curves, boxplots, and stacked dot plots. Can you see why having all 3 improves your understanding of the data?
What is the modality of the distribution?
Is the distribution skewed or symmetric?
Are there any outliers?
What are the appropriate summary statistics for a distribution with this shape?
What is the modality of each distribution? Are they skewed? Would the mean be greater than, lesser than, or about equal to the median? Are there any outliers?
datasets::iris data setDescribe the shape of these different distributions. Do any of them look normally distributed?
datasets::iris data setWhen a distribution has multiple modes or is unusually distributed, it may be better to visualize the data separated by a categorical variable.
datasets::iris data setWhat type of special distribution is this? What summary statistics best describe this type of distribution?
Analyze contingency (e.g. 2x2) tables
Summarizing categorical variables with proportions
Comparison of numerical data between categorical groups
Recognize common visualization techniques / plots
Numerical: Dot plots, histograms, density plots, QQ plots, box plots, violin plots
Categorical: bar plots, mosaic plots, tree map
Build basic visualizations in R using ggplot2
Data visualization do’s and dont’s
DATA1220-55 Fall 2024, Class 06 | Updated: 2024-09-11 | Canvas | Campuswire